Extracting Most Frequent Croatian Root Words Using Digram Comparison and Latent Semantic Analysis

نویسندگان

  • Zvonimir Rados
  • Franjo Jovic
  • Josip Job
چکیده

A method for extracting root words from Croatian language text is presented. The described method is knowledge-free and can be applied to any language. Morphological and semantic aspects of the language were used. The algorithm creates morph-semantic groups of words and extract common root for every group. For morphological grouping we use digram comparison to group words depending on their morphological similarity. Latent semantic analysis is applied to split morphological groups into semantic subgroups of words. Root words are extracted from every morpho-semantic group. When applied to Croatian language text, among hundred most frequent root words, produced by this algorithm, there were 60 grammatically correct ones and 25 FAP (for all practical purposes) correct root words.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Query expansion based on relevance feedback and latent semantic analysis

Web search engines are one of the most popular tools on the Internet which are widely-used by expert and novice users. Constructing an adequate query which represents the best specification of users’ information need to the search engine is an important concern of web users. Query expansion is a way to reduce this concern and increase user satisfaction. In this paper, a new method of query expa...

متن کامل

Extracting Domain-Dependent Semantic Orientations of Latent Variables for Sentiment Classification

Sentiment analysis of weblogs is a challenging problem. Most previous work utilized semantic orientations of words or phrases to classify sentiments of weblogs. The problem with this approach is that semantic orientations of words or phrases are investigated without considering the domain of weblogs. Weblogs contain the author’s various opinions about multifaceted topics. Therefore, we have to ...

متن کامل

A Latent Topic Extracting Method based on Events in a Document and its Application

Recently, several latent topic analysis methods such as LSI, pLSI, and LDA have been widely used for text analysis. However, those methods basically assign topics to words, but do not account for the events in a document. With this background, in this paper, we propose a latent topic extracting method which assigns topics to events. We also show that our proposed method is useful to generate a ...

متن کامل

A Character-based Approach to Distributional Semantic Models: Exploiting Kanji Characters for Constructing JapaneseWord Vectors

Many Japanese words are made of kanji characters, which themselves represent meanings. However traditional word-based distributional semantic models (DSMs) do not benefit from the useful semantic information of kanji characters. In this paper, we propose a method for exploiting the semantic information of kanji characters for constructing Japanese word vectors in DSMs. In the proposed method, t...

متن کامل

lsemantica: A Stata Command for Text Similarity based on Latent Semantic Analysis

The lsemantica command, presented in this paper, implements Latent Semantic Analysis in Stata. Latent Semantic Analysis is a machine learning algorithm for word and text similarity comparison. Latent Semantic Analysis uses Truncated Singular Value Decomposition to derive the hidden semantic relationships between words and texts. lsemantica provides a simple command for Latent Semantic Analysis ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005